ml project
Continuous Integration Practices in Machine Learning Projects: The Practitioners` Perspective
Bernardo, João Helis, da Costa, Daniel Alencar, Cogo, Filipe Roseiro, de Medeiros, Sérgio Queiróz, Kulesza, Uirá
Continuous Integration (CI) is a cornerstone of modern software development. However, while widely adopted in traditional software projects, applying CI practices to Machine Learning (ML) projects presents distinctive characteristics. For example, our previous work revealed that ML projects often experience longer build durations and lower test coverage rates compared to their non-ML counterparts. Building on these quantitative findings, this study surveys 155 practitioners from 47 ML projects to investigate the underlying reasons for these distinctive characteristics through a qualitative perspective. Practitioners highlighted eight key differences, including test complexity, infrastructure requirements, and build duration and stability. Common challenges mentioned by practitioners include higher project complexity, model training demands, extensive data handling, increased computational resource needs, and dependency management, all contributing to extended build durations. Furthermore, ML systems' non-deterministic nature, data dependencies, and computational constraints were identified as significant barriers to effective testing. The key takeaway from this study is that while foundational CI principles remain valuable, ML projects require tailored approaches to address their unique challenges. To bridge this gap, we propose a set of ML-specific CI practices, including tracking model performance metrics and prioritizing test execution within CI pipelines. Additionally, our findings highlight the importance of fostering interdisciplinary collaboration to strengthen the testing culture in ML projects. By bridging quantitative findings with practitioners' insights, this study provides a deeper understanding of the interplay between CI practices and the unique demands of ML projects, laying the groundwork for more efficient and robust CI strategies in this domain.
- South America > Brazil > Rio Grande do Norte > Natal (0.04)
- Oceania > New Zealand > South Island > Otago > Dunedin (0.04)
- North America > Canada > Ontario > Kingston (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Software Engineering (1.00)
- Information Technology > Software (1.00)
- Information Technology > Data Science (1.00)
- (2 more...)
MLScent A tool for Anti-pattern detection in ML projects
Shivashankar, Karthik, Martini, Antonio
--Machine learning (ML) codebases face unprecedented challenges in maintaining code quality and sustainability as their complexity grows exponentially. While traditional code smell detection tools exist, they fail to address ML-specific issues that can significantly impact model performance, reproducibility, and maintainability. This paper introduces MLScent, a novel static analysis tool that leverages sophisticated Abstract Syntax Tree (AST) analysis to detect anti-patterns and code smells specific to ML projects. MLScent implements 76 distinct detectors across major ML frameworks including T ensorFlow (13 detectors), PyT orch (12 detectors), Scikit-learn (9 detectors), and Hugging Face (10 detectors), along with data science libraries like Pandas and NumPy (8 detectors each). Our evaluation demonstrates MLScent's effectiveness through both quantitative classification metrics and qualitative assessment via user studies feedback with ML practitioners. Results show high accuracy in identifying framework-specific anti-patterns, data handling issues, and general ML code smells across real-world projects. The software development landscape has undergone a dramatic transformation with the integration of Machine Learning (ML). Recent statistics from Gartner highlight this shift, revealing a striking 270% increase in ML adoption within enterprise software projects over the last four years [1]. This rapid adoption, however, brings its own set of complexities. Traditional software development practices have had to evolve significantly to accommodate ML's unique requirements, including the need for extensive datasets, sophisticated algorithms, and iterative development cycles [3]. These fundamental differences have catalyzed a complete reimagining of software development methodologies, from initial design through testing and maintenance [4], [5] which is also highlighted by Tang et al. [6] in their empirical study of ML systems refactoring and technical debt. ML projects introduce distinct code quality challenges that set them apart from conventional software development. The complexity stems from their inherent characteristics: intricate mathematical operations, extensive data preprocessing requirements, and sophisticated model architectures that challenge traditional code maintenance approaches [7].
- North America > United States (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.88)
"They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing
Duan, Moming, Zhao, Rui, Jiang, Linshan, Shadbolt, Nigel, He, Bingsheng
As model parameter sizes reach the billion-level range and their training consumes zettaFLOPs of computation, components reuse and collaborative development are become increasingly prevalent in the Machine Learning (ML) community. These components, including models, software, and datasets, may originate from various sources and be published under different licenses, which govern the use and distribution of licensed works and their derivatives. However, commonly chosen licenses, such as GPL and Apache, are software-specific and are not clearly defined or bounded in the context of model publishing. Meanwhile, the reused components may also have free-content licenses and model licenses, which pose a potential risk of license noncompliance and rights infringement within the model production workflow. In this paper, we propose addressing the above challenges along two lines: 1) For license analysis, we have developed a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. 2) For standardized model publishing, we have drafted a set of model licenses that provide flexible options to meet the diverse needs of model publishing. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as a first step toward Linked Open Model Production Data. We have also encoded our proposed model licenses into rules and demonstrated the effects of GPL and other commonly used licenses in model publishing, along with the flexibility advantages of our licenses, through comparisons and experiments.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Finland (0.04)
- North America > United States > New Jersey (0.04)
- (5 more...)
- Information Technology > Communications > Web (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
How do Machine Learning Projects use Continuous Integration Practices? An Empirical Study on GitHub Actions
Bernardo, João Helis, da Costa, Daniel Alencar, de Medeiros, Sérgio Queiroz, Kulesza, Uirá
Continuous Integration (CI) is a well-established practice in traditional software development, but its nuances in the domain of Machine Learning (ML) projects remain relatively unexplored. Given the distinctive nature of ML development, understanding how CI practices are adopted in this context is crucial for tailoring effective approaches. In this study, we conduct a comprehensive analysis of 185 open-source projects on GitHub (93 ML and 92 non-ML projects). Our investigation comprises both quantitative and qualitative dimensions, aiming to uncover differences in CI adoption between ML and non-ML projects. Our findings indicate that ML projects often require longer build durations, and medium-sized ML projects exhibit lower test coverage compared to non-ML projects. Moreover, small and medium-sized ML projects show a higher prevalence of increasing build duration trends compared to their non-ML counterparts. Additionally, our qualitative analysis illuminates the discussions around CI in both ML and non-ML projects, encompassing themes like CI Build Execution and Status, CI Testing, and CI Infrastructure. These insights shed light on the unique challenges faced by ML projects in adopting CI practices effectively.
- Europe > Portugal > Lisbon > Lisbon (0.05)
- South America > Brazil > Rio Grande do Norte > Natal (0.04)
- North America > United States > Virginia (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
successful-machine-learning-development-requires-a-new-paradigm-thought-leaders
Initiatives using machine learning cannot be treated in the same manner as projects involving conventional software. It's imperative to move quickly so that you can test things, fix issues and test them again. In other words, you must be able to fail quickly – and do so early on in the process. Waiting until later in this process to find issues can end up being very expensive and time-consuming. When developing software using the traditional method, you use decision logic.
Enterprise ML Platforms Done Right
Many companies are attempting to speed up the delivery of their machine learning (ML) projects by creating platforms. While a few have succeeded, some have experienced significant failures, and most have ended up somewhere in the middle. This can happen when they address MLOps without first addressing their organizational structure and operating model. In this article, we will explore common pitfalls enterprises encounter when building ML platforms and provide solutions to help overcome these obstacles. We will tackle five common pitfalls enterprises face when getting their platform up and running and propose prescriptive solutions for each. To simplify the language, we will use the term "you" to refer to the team responsible for building and maintaining the platform.
New at Civo Navigate: Making Machine Learning Set up Faster - The New Stack
Of the time it takes to set up a machine learning project, 60% is actually spent performing infrastructure engineering tasks. That compares to 20% doing data engineering, Civo Chief Innovation Officer Josh Mesout, who has launched 300 machine learning (ML) models in the past two and a half years, said at the Civo Navigate conference here on Tuesday. Civo hopes to simplify machine learning infrastructure with a new managed service offering, Kubeflow as a Service, which it says will improve the developer experience and reduce the time and resources required to gain insights from machine learning algorithms. The Kubernetes cloud provider is betting that developers don't want to deal with the infrastructure piece of the ML puzzle. So its new offering will run the infrastructure for ML as a managed service, while supporting open source tools and frameworks. It believes this will make ML more accessible to smaller organizations, which it said are often priced out of ML due to economies of scale.
- North America > United States > New York (0.05)
- North America > United States > Florida > Hillsborough County > Tampa (0.05)
- Europe > Germany > Hesse > Darmstadt Region > Frankfurt (0.05)
What Most People Don't Understand About AI - and The Ultimat
In other words, to say that artificial intelligence (AI) is the next step in enterprise would be an understatement. But while it is well known that AI is the next step forward, myths and misconceptions about AI and its processes still run rampant. In order for AI and ML to be used to their maximum potential to help streamline enterprise, reduce costs, reduce risk and increase profits, it needs to be implemented with precision by those with realistic expectations. In 2019, Techopedia ran a two-part survey and quiz to help us examine how well industry executives comprehend AI and machine learning (ML). The results of our survey supported one clear answer: Business and industry executives do not understand the majority of AI and ML.
KID, DataRobot partnership makes data science accessible to every business
Amid soaring demand for tools to enable the data-driven organisation, a partnership between data specialists Knowledge Integration Dynamics (KID) and global AI cloud leader DataRobot is automating and democratising artificial intelligence (AI) and machine learning (ML), putting it into the hands of more South African businesses. Markus Top, who is heading up the partnership at KID, says it is a logical next step for KID, which has supported South African enterprises through their data journey for over 20 years. "Every business today wants to be data driven and embed AI at scale. However, until fairly recently achieving this has been a costly and time-consuming task," Top says. "With DataRobot, the manual, time-consuming processes within AI and ML projects are largely automated, allowing businesses to transform and innovate faster."
- Africa > South Africa > Western Cape > Cape Town (0.05)
- Africa > South Africa > Gauteng > Pretoria (0.05)
- Africa > South Africa > Gauteng > Johannesburg (0.05)
- Information Technology > Communications > Social Media (0.50)
- Information Technology > Artificial Intelligence > Machine Learning (0.35)
- Information Technology > Data Science > Data Mining (0.32)
The Most Fundamental Layer of MLOps -- Required Infrastructure
In my previous post, I have discussed the three key components to build an end-to-end MLOps solution, which are data and feature engineering pipelines, ML model training, and retraining pipeline ML model serving pipelines. You can find the article here: Learn the core of MLOPS -- Building ML Pipelines. At the end of my last post, I briefly talked about the fact that the complexities of MLOps solutions can vary significantly from one to another, depending on the nature of the ML project, and more importantly, variations of the underlying infrastructure required. Therefore in today's post, I will explain how the different levels of Infrastructure required, determine the complexities of MLOps solutions, as well as categorize MLOPS solutions into different levels. More importantly, in my view, categorizing MLOps into different levels makes it easier for organizations of any size to adopt MLOps.